CMDA 4654 Project 2
Julia Brady, Priya Bhat, Mason Colt, Matt Nissen, Dylan Fair, Jamal Mani
2025-11-22
What is Least Angle Regression?
- A regression algorithm for high-dimensional data
- Builds the model incrementally like forward selection
- But less greedy and more statistically
efficient
- Computes the entire LASSO path with a small
modification
- Produces smooth, piecewise-linear coefficient trajectories
Why Do We Need LARS?
Forward Selection Problems:
- Commits too strongly to the first chosen variable
- Struggles with correlated predictors
- Greedy -> unstable -> can miss important variables
LARS Fixes This:
- Takes controlled, geometric steps
- Adjusts direction whenever correlations change
- Never “overcommits” too early
- Fair to correlated predictors
Key Idea
LARS moves in the direction that forms equal angles
with all predictors most correlated with the residual.
- A balanced “least angle” update
- A clear sequence of variable entry
- A smooth coefficient path
Setup and Notation
We model:
\[
\mu = X\beta
\]
At any step:
Residual:
\[
r = y - X\beta
\]
Correlations with residual: \[
c_j = x_j^\top r
\]
Active set: \[
A = \{ j : |c_j| = \max_k |c_k| \}
\]
The Equiangular Direction
To update the model, LARS finds a vector \(u_A\) such that:
- Every active predictor makes the same angle with
\(u_A\)
- Each active predictor has the same correlation with
the update
Mathematically:
\[
u_A = X_A w_A, \quad
w_A = \frac{G_A^{-1} 1_A}{\sqrt{1_A^\top G_A^{-1}1_A}}
\]
- \(X_A\) is the matrix of active
predictors
- \(G_A = X_A^\top X_A\)
- \(1_A\) is a vector of ones
This ensures a balanced movement among active
predictors.
Updating the Model
The fitted values update via:
\[
\mu \leftarrow \mu + \gamma\, u_A
\]
Where the step size \(\gamma\) is
the largest value such that:
- All active predictors remain tied
- A new predictor reaches the same correlation level
Thus the active set expands exactly when it should.
Algorithm Steps
Initial Equation: \(y = \beta_{0}\)
Final Equation: \(y = \beta_{0} + \beta_{1}x_1
+ \beta_{2}x_2 + ... + \beta_{n}x_n\)
- Take the correlation of residuals with every predictor variable and
find the maximum.
- Add the highest-correlation predictor \(x_j\) to the equation with coefficient
\(\gamma\): \(r = y- \beta_0 - \gamma_j x_j\), \(y = \beta_0 + \gamma_j x_j\)
- Increase \(\gamma\) in the
direction of its correlation with y (positive or negative), taking
residuals along the way, until \(Cor(x_j, r) =
Cor(x_k, r)\) for some other predictor \(x_k\).
- Continue moving along \((x_j,
x_k)\) by increasing \((\gamma_j,
\gamma_k)\) in their least squares direction, taking residuals
along the way, until some predictor \(x_m\) has as much correlation as the
residual.
- Repeat this until all predictors are included in the model.
Algorithm
- Take the correlation of residuals with every predictor variable and
find the maximum.
At the beginning of the algorithm, we set our intercept \(\beta_0\) equal to the average of our
y-vector, denoted by \(\bar y\).
Residuals, then, are given by \(r = y -
\beta_0 = y - \bar y\)
We can denote the correlation of residuals with each variable as
\(Cor(r, \begin{bmatrix} x_1 \\ x_2 \\ \vdots
\\ x_n \end{bmatrix}) = \begin{bmatrix} c_1 \\ c_2 \\ \vdots \\ c_n
\end{bmatrix}\)
Let \(max \{ {\begin{bmatrix} c_1 \\ c_2 \\
\vdots \\ c_n \end{bmatrix}} \}=c_{max}\).
We select the predictor \(x_j\)
corresponding to \(c_{max}\) as our
first active predictor.
Algorithm
- Add the highest-correlation predictor \(x_j\) to the equation with coefficient
\(\gamma\):
\(r = y- \beta_0 - \gamma_j
x_j\)
\(\gamma_j\) is a temporary step
size along predictor \(x_j\). We will
compute its value and assign it to \(\beta_j\).
Algorithm
- Increase \(\gamma\) in the
direction of its correlation with y (positive or negative), taking
residuals along the way, until \(Cor(x_j, r) =
Cor(x_k, r)\) for some other predictor \(x_k\).
The direction of correlation is given by \(u\), a unit vector equal to \(x_j\). The residual vector, constantly
being updated as \(\gamma\) increases,
is given by \(r(\gamma)\).
\(r(\gamma) = r_0 - \gamma u\),
where \(r_0 = y - \bar y\).
To find \(\gamma_j\), consider all
predictors not added to the model yet. Solve
\(C = Cor(x_j, r(\gamma))=Cor(x_i,
r(\gamma))\). This will give us \((n-1)\) values of \(\gamma\) for \(n\) predictive variables.
The minimum \(\gamma\) value is our
\(\beta_j\). The corresponding
predictor is the next active predictor, which we will add in the next
step and denote \(x_k\).
Now, we have \(y = \beta_0 + \beta_j
x_j\).
Algorithm
- Continue moving along \((x_j,
x_k)\) in their least squares direction, taking residuals along
the way, until some predictor \(x_m\)
has as much correlation as the residual.
Because we now have 2 predictors, our direction vector \(u\) will be updated according to both.
Also, we will move \(\gamma_k\) to
update our fitted values and residuals.
Then, for every predictor we’re not already using, we find the
minimum \(\gamma\) value again. It will
update each of our coefficients and the corresponding predictor \(x_m\) will be added to the model next.
Now, we have our updated \(\beta_k\), and our equation becomes \(y = \beta_0 + \beta_j x_j + \beta_k
x_k\).
Algorithm
- Continue moving along \((x_j, x_k,
x_m)\) in their least squares direction, taking residuals along
the way, until some predictor \(x_p\)
has as much correlation as the residual vector.
Once again, our direction vector \(u\) is updated according to all the
predictors, and we will move \(\gamma_m\) to update both our fitted values
and residuals.
We find the minimum \(\gamma\) value
among all non-active predictors, update our coefficients, and we will
add \(x_n\) to the next iteration of
the model.
- We will repeat this until all predictors are included in the
model.
This movement is piecewise-linear and
geometrically optimal.
When LARS is Appropriate
High-dimensional data with many predictors:
Excellent for datasets where the number of predictors exceeds the number
of observations (\(p >
n\)).
Variable selection is a priority: Provides a
clear and complete solution path showing the sequence and importance of
features entering the model.
Computational efficiency matters: Computes the
full regularization path with similar speed to a single least squares
fit, \(O\)(\(k^3+pk^2\)).
Collinearity among predictors: Effectively
manages correlated predictors by moving them together, helping to
identify meaningful feature groups.
Real-World Examples: LARS is Appropriate
Genomics: Gene expression analysis with
thousands of genes but few samples (\(p
>> n\)).
Financial modeling: Portfolio optimization with
hundreds of assets and economic indicators.
Medical imaging: Disease prediction from
high-dimensional MRI/CT features with limited patient data.
Marketing analytics: Customer behavior modeling
with many potential predictors (demographics, interactions,
preferences).
When LARS is NOT Appropriate
Non-linear relationships: Assumes linearity;
unsuitable for data with non-linear patterns between features and the
response.
Categorical variables dominate: Best suited for
continuous predictors; extensive categorical data may not fully leverage
LARS’s benefits.
Outliers are present: Highly susceptible to
outliers, which can significantly distort the regression path (consider
alternatives).
Temporal Dependencies: Fails to capture inherent
trends, seasonality, or autocorrelation present in time-series
data.
Very large sample size, few features: Standard
linear regression is often simpler and equally effective when
observations heavily outweigh the number of features.
Real-World Examples: LARS is NOT Appropriate
Image recognition: Deep non-linear relationships
between pixels and object classes.
Stock price prediction: Time-series with
autocorrelation, trends, and temporal dependencies.
Survey data with outliers: Data quality issues
where extreme responses distort the model.
Simple A/B testing: Few variables with large
sample sizes where standard regression suffices.
LARS using iris Dataset
LARS requires us to have:
- a numeric vector for the response (y)
- a matrix of predictor variables (x)
y <- iris$Sepal.Length
x <- as.matrix(iris[, c("Sepal.Width", "Petal.Length", "Petal.Width")])
Model Summary
##
## Call:
## lars(x = x, y = y, type = "lar")
## R-squared: 0.859
## Sequence of LAR moves:
## Petal.Length Sepal.Width Petal.Width
## Var 2 1 3
## Step 1 2 3
iris LARS Plot
We can see that Petal.Length has the strongest
correlation with Sepal.Length. Sepal.Width has the 2nd
strongest correlation, and Petal.Width is the least
correlated.

## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
LARS using mtcars Dataset
In this model, we are looking to predict fuel efficiency:
- Miles per Gallon (mpg) as our response variable (y)
- Matrix of all predictor variables (x)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Model Summary
## LARS/LAR
## Call: lars(x = x_val, y = y_val, type = "lar")
## Df Rss Cp
## 0 1 1126.05 130.3246
## 1 2 992.54 113.3155
## 2 3 378.79 27.9310
## 3 4 194.17 3.6454
## 4 5 190.76 5.1607
## 5 6 184.29 6.2386
## 6 7 170.09 6.2177
## 7 8 169.29 8.1030
## 8 9 157.32 8.3992
## 9 10 151.71 9.6000
## 10 11 147.49 11.0000
mtcars LARS Plot
Here we see that wt and cyl enter
the model first, meaning they are more strongly correlated to
mpg than other variables such as disp
and qsec.

Key Findings
Through these two examples, we can see that LARS makes it very
easy to identify which variables have the most influence over our
response variable.
It shows us how important each predictor is, while also showing
us where each variable enters model.
LARS also adds variables into the model in a balanced way without
over fitting.
Comparing LARS to Other Regression Methods
Iris: LARS vs Other Regression Methods
We predict Sepal.Length from the three other numeric
iris variables and compare four models:
- OLS Test MSE: 0.079
- Stepwise Test MSE: 0.079
- Lasso Test MSE: 0.079
- LARS Test MSE: 0.079
Interpretation:
- All four models have very similar test error because iris is small
and low-dimensional.
- Lasso achieves the lowest MSE due to coefficient shrinkage (slightly
reducing variance).
- LARS nearly matches Lasso’s performance because its solution path
closely approximates the Lasso path.
- Stepwise and OLS perform similarly since most predictors are
genuinely informative.
| OLS |
0.079 |
| Stepwise |
0.079 |
| Lasso |
0.079 |
| LARS |
0.079 |
Iris: Lasso Cross-Validation Curve
Lasso selects its penalty parameter λ using cross-validation.
- Small λ → more flexible model
- Large λ → stronger shrinkage
- The dashed vertical line marks the λ minimizing CV error

Iris: LARS Coefficient Paths
The LARS coefficient path shows:
- The order predictors enter the model
- How quickly coefficients grow
- The piecewise-linear nature of the LARS solution

mtcars: LARS vs Other Regression Methods
We predict mpg in the mtcars dataset using all other
variables:
- OLS Test MSE: 5.59
- Stepwise Test MSE: 5.55
- Lasso Test MSE: 5.3
- LARS Test MSE: 5.54
Interpretation:
- mtcars is extremely small (32 rows), so each train/test split is
noisy.
- All four models have almost identical MSE on this split.
- Lasso and LARS again behave very similarly in terms of
prediction.
- With such a tiny dataset, shrinkage and selection provide limited
gains over OLS, but they illustrate the methods’ behavior.
| OLS |
5.5893 |
| Stepwise |
5.5515 |
| Lasso |
5.3049 |
| LARS |
5.5410 |
Takeaways from Baby Datasets
- On simple, low-dimensional datasets, all four models achieve very
similar performance.
- Lasso and LARS consistently match or slightly beat OLS and
Stepwise.
- LARS is especially appealing when:
- the number of predictors is large,
- predictors are correlated,
- and we care about the full coefficient path for interpretation.
- These baby examples confirm that LARS behaves like Lasso in terms of
predictive accuracy, while providing an efficient and interpretable
solution path.
Main Dataset
This project uses the Burke et al. (2022) global urban soil black
carbon dataset, obtained from the Knowledge Network for Biocomplexity
(KNB) at: https://knb.ecoinformatics.org/view/urn:uuid:1651eeb1-e050-4c78-8410-ec2389ca2363
The dataset pulls together measurements of black carbon in urban
soils from cities around the world. Each row includes details like
latitude/longitude, elevation, precipitation, soil temperature at
different depths, land-cover type, population info, and notes from the
original studies. The main sheet (“Urban Black Carbon”) contains 600+
observations and about 65 variables, giving us a wide mix of
environmental and geographic predictors.
Because many of these variables move together (climate, location,
soil traits, etc.), the dataset naturally has clusters of correlated
features, which makes it a solid fit for demonstrating Least Angle
Regression (LARS).
Data Dictionary
We removed variables with 90+% missing values to avoid unstable
predictors and ensure consistent sample size across all variables. This
threshold preserved essential environmental predictors while excluding
sparse fields that contained too little information to contribute to
modeling.
BC vs Depth
## `geom_smooth()` using formula = 'y ~ x'

Takeaways
High BC values occur almost entirely near the surface (0–1 cm),
reflecting strong urban deposition.
BC drops sharply with depth, flattening to low levels beyond ~4
cm.
Variance is very large at shallow depths, but nearly zero deeper
in the profile.
Pattern indicates a nonlinear depth–BC relationship and potential
heteroskedasticity.
Supports LARS: depth correlates with other environmental factors,
and regularization helps stabilize selection in the presence of such
gradients.
Predictor Correlation Heatmap

Takeaways
Strong spatial correlation between latitude and
longitude.
The two soil temperature variables are highly
correlated.
Elevation and precipitation follow the same environmental
gradient.
Overall: clear multicollinearity clusters, meaning several
predictors share similar information. This supports using LARS.
LARS on Large Data
Test-Set Performance of the LARS Model
|
Model
|
Test_MSE
|
|
LARS (Cp-selected)
|
153.7056
|
Note: On held-out soil samples, the model’s predictions differ
from the observed black carbon values by about 12.4 mg/g on
average
This level of error is expected for this dataset, because black
carbon concentrations are extremely variable near the soil surface.
LARS Coefficient Path
The coefficient path illustrates:
the order in which predictors enter the model
the strength of their standardized effects
and how the LARS moves piecewise-linearly through predictor
space

Interpretation
Depth variables are the first to enter the model, with
depth_range_cm entering immediately and taking on a large
negative coefficient. This confirms that black carbon decreases strongly
with depth range, consistent with the earlier exploratory
plots.
Latitude and elevation enter early as well, both with strong
positive standardized coefficients. This indicates that locations at
higher latitudes and higher elevations tend to have higher black carbon
values, after adjusting for depth.
Climate variables (precipitation and soil temperatures) enter in
the middle of the path with moderate positive effects. Their smoother,
slower coefficient growth reflects their weaker marginal relationship
after the major depth and geographic patterns are accounted
for.
Interpretation
Longitude shows a small negative effect, entering later and
remaining relatively flat. This suggests that east-west location
explains little additional variation once depth, latitude, and elevation
are included.
The dashed vertical line marks the Cp-selected step, which
represents the best balance between model complexity and predictive
accuracy. Beyond this point, adding additional predictors primarily
increases variance without improving model fit.
Overall, LARS highlights a small set of dominant predictors
(depth range, latitude, and elevation) while shrinking or delaying
weaker, highly correlated variables, giving a clear picture of the main
environmental drivers of black carbon.